Automatic summarization model based on clustering algorithm

Extractive document summary is usually seen as a sequence labeling task, which the summary is formulated by sentences from the original document. However, the selected sentences usually are high redundancy in semantic space, so that the composed summary are high semantic redundancy. To alleviate this problem, we propose a model to reduce the semantic redundancy of summary by introducing the cluster algorithm to select difference sentences in semantic space and we improve the base BERT to score sentences. We evaluate our model and perform significance testing using ROUGE on the CNN/DailyMail datasets compare with six baselines, which include two traditional methods and four state-of-art deep learning model. The results validate the effectiveness of our approach, which leverages K-means algorithm to produce more accurate and less repeat sentences in semantic summaries.

the proposed EDS model and conduct a significance test using the ROUGE score metrics on the CNN/DailyMail dataset to compare the baseline model.

Extractive document Summary
The EDS task aims to extract sentences from the original document to form a summary.The task first encodes the sentences with the help of an encoder to obtain words vector, then pass a sentence encode module to get a sentence vector to represent a sentence.The sentence vector then passes through a classification layer to determine whether it should be included in the summary.Nallapati et al. 2 ; Zhou et al. 5 chose recurrent neural networks (RNN) for sentence encoding, while Wang et al. 6 had been using transformer.BERT and other pretrained language models also perform well in the EDS task.
Although these methods are effective, they mostly did not consider the semantic problem.To alleviate the semantic problem, Mohd et al. 7 used a distributional semantic to capture and preserve the semantics of text as the fundamental feature for summarizing; Kirmani et al. 8 utilized bio-semantic models on the domain of bio-medical research; Bhat et al. 9 used emotion described by text as semantic feature; Kirmani et al. 10 proposed an email summarizing system by semantic models and deep-learning technologies to summarize emails; Mud et al. 11 proposed an advanced text document summarizer with cluster algorithm to preserving the underlying semantics of the original text.Although the above methods take into account the semantics of preserving the original text, it does not take into account the problem of redundancy, which may make the sentence semantic redundancy in the extracted summary.
Base on the above extract models and semantic summarizer, we introduce cluster algorithm to classify the sentences by their semantics, which improves model performance and reduce the semantic redundancy in summary.

Sentence embeddings for EDS
An important step in the task of extracting a document summary is obtaining sentence embedding.The traditional sentence embedding method is based on the weighted and average of word vectors to construct sentence vectors.Kedzie et al. 12 averaged the word embeddings of sentences to get sentence embeddings.This approach assumes that each word has the same effect on the sentence, regardless of the particularity of the particular word.Based on the current and previous hidden states of the word embedment, Nallapati et al. 2 applied RNN to compute the hidden state representation of each word position in turn, and then used the average, connected hidden states as sentence embedment.In contrast to the simple average word embedding representation, the approach of RNN considered the order of words.
We used the BERTSUM method to embedding sentence, utilized token embeddings, segment embeddings and position embeddings to take the information of the sentences into sentence embeddings.

Method
The definition of the EDS task as follows.Given a single document D that contains n sentences, D = {s 1 , s 2 , . . ., s n } , where s i = {w i1 , w i2 , . . ., w ij } is the i-th sentence in the document and w ij is the j − th word in the i − th sentence.EDS can be seen as a sequence labeling task, which means that every sentence in the document is assigned a label y ∈ {0, 1} to suggest whether the sentence should be included in the summary.We introduce cluster algorithm into EDS task, which is used with sentence embeddings to reduce the high semantic redundancy sentences in summary.In this section, we will first introduce the BERTSUM 13 model that is used for selecting sentences from document and present our cluster-EDS model.

Score-BERT
BERT model is used as the basis to build Score-BERT, which can score sentences.Score-BERT consists of two parts: encoder and sigmoid layer.In the encoder part, BERT model is used as the basis to encode the original text into semantic vector representation.Then the sigmoid layer takes a linear fully connected layer and uses sigmoid function to obtain a prediction score, which is used to select sentence.The Score-BERT is shown in Fig. 1.

Cluster-EDS model
Our model is based on Score-BERT.After Score-BERT selects n sentence vectors, the selected sentence vectors are mapped to the high-dimensional semantic space by K-means algorithm.K-means algorithm divides sentences into different clusters according to their semantics.Then, the sentences with higher sentence scores are selected from different clusters to form the summary, so that the selected sentences have less semantic redundancy, and the composed summary can have lower semantic redundancy.The model is shown in Fig. 2.
The K-means module receives the sentences embedding and the score from the Score-BERT, the K-means module then makes the sentences embedding map to high dimension semantic space and model will select the highest score of sentences from difference clusters, so the extracted summary is formulated by the sentences with high score and difference semantic.

Text matching
Zhong et al. 4 first applied method of text matching to extractive document summary task.Selected sentences formulated the candidate sets while passing through cluster algorithm.To be specific, candidate set is composed to sentences embeddings from different cluster.
We use the Siamese-BERT architecture to match the reference summary D and the candidate summary C .Siamese-BERT consists of two BERTs with tied-weights and a cosine-similarity layer during the inference phase.
We get the embeddings from reference summary D and candidate summary C by using the original BERT.We define r D and r C denote the embeddings of the reference summary D and candidate summary C .Their similarity score is measured by f (D, C) = cos(r D , r C ).

Experiment Datasets
We evaluate our approach on CNN/DailyMail dataset and get the better performance than baseline model.The CNN/DailyMail dataset is mainly derived from about 1 million news report data of CNN and the DailyMail.This dataset is a single-text abstract corpus, in which there are a large number of abstract chapters, and each chapter contains several abstract sentences, which are then briefly modified to form a corpus for text summarization technology.We follow the standard splits and preprocessing steps used in baseline, and resulting datasets contain 286,817 articles for training, 13,368 for validation, and 11,487 for test.

Implementation detail and metrics
We use the BERTSUM to select sentences and sentence-encoding.Adam optimizer with warming-up is used and learning rate schedule follows Vaswani et al. 14 as:  where each step is a batch size of 8 and wm denotes warmup steps of 10,000.We choose r1 = 0 and r2 = 0.01 .When r1 < 0.05 and 0.005 < r2 < 0.05 , they have little effect on performance, otherwise they will cause per- formance degradation.We use the validation set to save three best checkpoints on the test set.Importantly, all experimental results listed in this paper are the average of three runs.

Baselines and comparisons
To validate the effectiveness of the ClusterSum model.We choose some baseline models from two aspects to compare.On the one side, we choose the classic machine methods, on the other hand, we choose the-state-of-art deep learning model.Base on the above consideration, we compare our model with the following solid baselines for text summarization: LEAD-3: The method takes the first three sentences of the document as a summary.JECS: The method firstly selects sentences and then compresses each sentence by removing unnecessary words.
LSTMPN: An extractive model based on LSTM and pointer network.
LongformerExt: An extractive model based on Long Transformer.BERTSUMEXT: A method based on the pretrained model BERT.MatchSum: A model uses the approach of text matching to do the extractive summary task.

Results
We test our model on the CNN/DailyMail.Rouge 15 1.The results show that the experimental performance on ROUGE-1, ROUGE-2, ROUGE-l outperforms all the baseline methods.Compare with MatchSum, the score of three metrics all have increased, which can conclude that the Cluster model is effective for extracting summary.These three metrics respectively evaluation the accuracy on unigrams, bigrams and longest common subsequence.

Ablation study
We performed ablation experiments to discuss whether the K-means algorithm worked in ClusterSum model.Experiments are conducted on CNN/Daily Mail dataset.The models are presented as follow: ClusterSum/K-means: ClusterSum model without K-means algorithm.In order to determine the influence of K-means on the model, we removed the K-means module.The specific approach is sorting the sentences scored by Score-BERT according to their scores, and selecting the three sentences with the highest scores as the summary.
Table 2 shows the performance difference caused by the K-means algorithm.We can see that our model well on the news datasets CNN/Daily Mail.Combining the advantages of K-means algorithm in reducing semantic redundancy of sentences, and experiment results, we can conclude that K-means algorithm can reduce the repeat sentences in semantic space to increase the quality of the extracted summary.

Conclusion
Focusing on the high semantic redundancy problem of extractive document summary task, we presented how cluster algorithm can be usefully in EDS task by reducing semantic redundancy sentences.We introduced a novel way to get the low semantic sentences to improve the performance of the EDS model.The experimental results show that the model we proposed get the higher score compare with baseline method in ROUGE score, so prove that cluster algorithm is effective to reduce the semantic redundancy.
In future work, we will continue to explore the approaches to reduce the semantic redundancy in EDS task as follows: different cluster algorithms how to influence the EDS model selecting sentences.
scores measure the summarization quality.The results of our Cluster-BERT model are presented in Table 1.As shown in Table 1, we list strong baselines with different learning methods.The experimental results of our model based on the K-means algorithm model is shown in Table

Table 1 .
ROUGE scores measure the summarization quality.ROUGE-1, ROUGE-2 and ROUE-l are used for reporting the unigram, bigram and longest common subsequence overlap with reference summaries.

Table 2 .
The results of ablation study.